The recent buzz in academia involved two cases of academic fraud. One was against Francesca Gino, a behavioral scientist and top Professor at Harvard University. Ironically, her research was about honesty. The second case was against Marc Tessier-Lavigne, neuroscientist and former President of Standford University. He stepped down due to the allegations. While the media focused on the reputational damage suffered by these academic personalities, I wondered how many papers were affected by the eventual retractions of their papers. A search in Google Scholar shows that the Gino's most cited retracted paper received 527 citations; the number is 737 for Tessier-Lavigne. These citations extend to other researchers who referenced their work, and this ripple effect continues as these secondary citers themselves become primary sources for subsequent researchers.
I thought this issue was a good opportunity to put into context some basic network science concepts. In the process, I will show you how to write code that calculates and visualizes descriptive quantities from network data. In terms of programming knowledge, I assume you know how to run codes in a Python integrated development environment (IDE) like JupyterLab or Google Colab, and how to perform basic programming tasks (control flow statements, creating functions, manipulating data structures, importing libraries) in Python. In terms of math knowledge, I assume you have basic knowledge of functions and matrices.
A network is a system of interconnected entities. This broad definition allows network science to permeate all disciplines and industries. Here are some examples:
The simplest network is a system of two connected entities. Mathematically, this is represented by two nodes connected by an edge:
A graph refers to the mathematical representation of a network, but you will hear graph and network used interchangeably. A graph can be as simple as this:
and as complex as this:
This is a co-occurrence graph of the book "A Game of Thrones" by George R. R. Martin. Characters are connected if their names appear within 13 words of each other in the novel. The thickness of the link represents the number of times they co-occur. Data obtained from [Kaggle](https://www.kaggle.com/datasets/mmmarchetti/game-of-thrones-dataset) and plotted using [Gephi](https://gephi.org/).
The examples above touched upon types of networks. Let us make the distinctions explicit:
How do you describe networks? What can be calculated from a graph that translates to real-world insights? Such quantities will be explored in the context of a citation network: a directed, unweighted, and non-bipartite network. In citation networks, a node represents a scientific paper and a directed edge represents a citation from one paper to another. By calculating these quantities, we gain a better understanding of the effects of retraction on the academic publishing system.
This is a citation network. Circles represent research papers and arrows represent citations. For example, first author Sung was cited by the sources of the arrows pointing to it, such as Khoury and Wang. Source: [researchgate.net](https://www.researchgate.net/figure/Directed-citation-network-Nodes-represent-papers-in-the-corpus-Directed-edges-represent_fig2_313265540)
The data comes from SNAP Datasets, a collection of large-scale network data stored in text format. We will use the High-energy physics theory citation network page which contains a citation network of theoretical high energy physics papers in the open-access hub Arxiv.
Screenshot from the High-energy physics theory citation network dataset.
There are three downloadable files in the Files section of the page. We will only use cit-HepTh.txt.gz. To open this compressed file, we use the gzip module from Python's standard library:
import gzip
input_file = r"C:\Users\63926\Desktop\cit-HepTh.txt.gz"
output_file = r"C:\Users\63926\Desktop\cit-HepTh.txt"
with gzip.open(input_file, 'rb') as f_in:
file_content = f_in.read()
with open(output_file, 'wb') as f_out:
f_out.write(file_content)
The code saves the contents of the compressed file as cit-HepTh.txt. Make sure you edit the input_file and output_file strings to match the file paths in your device. The text file leads with four lines that describe the dataset followed by tab-separated numbers (see screenshot below). The first column gives the ID for the citer, while the second column refers to the referenced paper. This data structure is called an edge list.
Screenshot of the contents of the text file.
Let us put the data in a NumPy array. For first-time users, NumPy is a Python library for manipulating arrays and matrices. You may install it using
!pip install numpy
Requirement already satisfied: numpy in c:\users\63926\anaconda3\lib\site-packages (1.23.5)
We then import the library and load the data using the np.loadtxt function. The argument comments='#' removes the first few lines of the text file, leaving only the ID values.
import numpy as np
edge_list = np.loadtxt(gzip.open(input_file, 'rb'), dtype=int, comments='#')
edge_list
array([[ 1001, 9304045],
[ 1001, 9308122],
[ 1001, 9309097],
...,
[9912286, 9808140],
[9912286, 9810068],
[9912286, 9901023]])
The file cit-HepTh-dates.txt.gz contains the paper submission dates. This data is useful for more advanced explorations involving time dynamics. The file cit-HepTh-abstracts.tar.gz contains the abstracts of the papers, and could serve as an entry point for natural language processing and topic modeling.
We need two additional Python libraries: one to compute descriptive quantities and another to visualize them. For these tasks, we will install and import the NetworkX and Matplotlib libraries:
!pip install networkx
!pip install matplotlib
Requirement already satisfied: networkx in c:\users\63926\anaconda3\lib\site-packages (2.8.4) Requirement already satisfied: matplotlib in c:\users\63926\anaconda3\lib\site-packages (3.7.0) Requirement already satisfied: numpy>=1.20 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (1.23.5) Requirement already satisfied: fonttools>=4.22.0 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (4.25.0) Requirement already satisfied: contourpy>=1.0.1 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (1.0.5) Requirement already satisfied: pillow>=6.2.0 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (9.4.0) Requirement already satisfied: packaging>=20.0 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (23.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (1.4.4) Requirement already satisfied: cycler>=0.10 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (0.11.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\63926\anaconda3\lib\site-packages (from matplotlib) (2.8.2) Requirement already satisfied: six>=1.5 in c:\users\63926\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
import networkx as nx
from matplotlib import pyplot as plt
To transform the edge list to a NetworkX graph object, I used the nx.DiGraph function to create an empty directed graph. Iterating over the entries of the edge list, I added each entry to the empty graph by applying the DiGraph.add_edge method. Despite its name, this function also adds the nodes together with the edge to the graph.
G = nx.DiGraph()
for from_node, to_node in edge_list:
G.add_edge(from_node, to_node)
The size and order of a network refer to the number of nodes and edges, respectively. Using the DiGraph.order and DiGraph.size methods reveal 27,770 research papers and 352,324 citations in the citation network, which puts it in the realm of large-scale networks.
N = G.order()
L = G.size()
print ('Number of nodes: ', N)
print ('Number of edges: ', L)
Number of nodes: 27770 Number of edges: 352807
NetworkX is not equipped to handle the visualization of large networks. To see the citation graph, we use the visualization tool Gephi. I will not discuss how to use the program. I recommend this article for a tutorial.
My mid-spec laptop could not properly render the visuals for all the nodes, so I filtered them to include only those papers that received 100 citations or more. The size and hue of the circles denote the number of citations. It is difficult to extract insights from such a convoluted graph. This is why descriptive quantities are important: they give us numerical bases for qualitative characteristics.